LLM 25-Day Course - Day 15: Advanced Tokenizer

Day 15: Advanced Tokenizer

The tokenizer is the translator between text and the model. Without proper tokenization, even the best model cannot function correctly. Today we cover advanced tokenizer topics frequently encountered in practice.

encode/decode and Special Tokens

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

text = "Hello, how are you?"

# encode: text -> list of token IDs
token_ids = tokenizer.encode(text)
print(f"Token IDs: {token_ids}")
# [101, 7592, 1010, 2129, 2024, 2017, 1029, 102]
# 101=[CLS], 102=[SEP] special tokens are automatically added

# decode: token IDs -> text
decoded = tokenizer.decode(token_ids)
print(f"Decoded: {decoded}")

# encode without special tokens
token_ids_no_special = tokenizer.encode(text, add_special_tokens=False)
print(f"Without special tokens: {token_ids_no_special}")

# Inspect individual tokens
tokens = tokenizer.tokenize(text)
print(f"Token list: {tokens}")
# ['hello', ',', 'how', 'are', 'you', '?']

Padding and Truncation Strategies

When processing batches, all inputs must have the same length. Padding extends short inputs, and truncation clips long inputs.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

sentences = [
    "A short sentence",
    "This is a somewhat longer sentence",
    "This is a very very very long sentence to demonstrate the difference between padding and truncation",
]

# Batch tokenization - apply padding and truncation simultaneously
encoded = tokenizer(
    sentences,
    padding=True,          # Pad to the longest sentence
    truncation=True,       # Truncate if exceeding max_length
    max_length=20,         # Maximum number of tokens
    return_tensors="pt",   # Return as PyTorch tensors
)

print(f"input_ids shape: {encoded['input_ids'].shape}")
print(f"attention_mask shape: {encoded['attention_mask'].shape}")

# attention_mask: 1 means real token, 0 means padding
for i, sent in enumerate(sentences):
    real_tokens = encoded["attention_mask"][i].sum().item()
    print(f"Sentence {i+1}: {real_tokens} real tokens, {20 - real_tokens} padding tokens")

Building Chat Format with chat_template

Modern LLMs require a chat format (system/user/assistant). Using apply_chat_template() automatically applies the correct format for each model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

messages = [
    {"role": "system", "content": "You are a friendly AI assistant."},
    {"role": "user", "content": "Tell me 3 advantages of Python."},
]

# Convert chat format to model-specific format
formatted = tokenizer.apply_chat_template(
    messages,
    tokenize=False,           # Return as string (True returns token IDs)
    add_generation_prompt=True, # Add assistant response start tag
)
print(formatted)

# Tokenize in one step
input_ids = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
)
print(f"Token count: {input_ids.shape[-1]}")

Each model has a different chat_template. Llama uses <|begin_of_text|> tags, while the ChatML format uses <|im_start|> tags. Using apply_chat_template() means you do not need to worry about these differences.

Today’s Exercises

Tokenize the same English sentence with gpt2 and bert-base-uncased tokenizers, then compare the token count and tokenization approach differences.
Batch-tokenize 5 sentences of different lengths with padding="max_length" and max_length=32, then calculate the padding ratio in the attention_mask of each sentence.
Compare the apply_chat_template() results of 2 models (e.g., Llama, Mistral) to see how the same conversation is formatted differently for each model.